## [1] "fixed.acidity" "volatile.acidity"
## [3] "citric.acid" "residual.sugar"
## [5] "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density"
## [9] "pH" "sulphates"
## [11] "alcohol" "quality"
## [13] "quality_lev_f" "fixed_to_volatile_acid_level"
## [15] "sugar_to_sulfates_ratio"
The Dataset contains 4,898 white wines and with 11 variables quantifying the chemical properties of each wine. Comparing the number rows and columns captured by R against the number of rows and columns in the CSV reveals we’ve successfully read the entire file contents. Wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). All columns except for “quality” and the “quality_lev_f” factor contain decimal values.
## Rows: 4,898
## Columns: 15
## $ fixed.acidity <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, …
## $ volatile.acidity <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32…
## $ citric.acid <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16…
## $ residual.sugar <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.0…
## $ chlorides <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050…
## $ free.sulfur.dioxide <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, …
## $ total.sulfur.dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 13…
## $ density <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, …
## $ pH <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18…
## $ sulphates <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47…
## $ alcohol <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8…
## $ quality <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7…
## $ quality_lev_f <fct> above avg., above avg., above avg., abov…
## $ fixed_to_volatile_acid_level <dbl> 25.926, 21.000, 28.929, 31.304, 31.304, …
## $ sugar_to_sulfates_ratio <dbl> 46.000, 3.265, 15.682, 21.250, 21.250, 1…
There’s a nearly identical amount of Below Average and Exceptional wines. The majority of wines fall within the Above Average quality category followed by the Average quality category wines. The variables (excluding ratio variables) with the largest ranges are total.sulfur.dioxide, free.sulfur.dioxide and residual.sugar. This may indicate a likelihood for outliers in these variables.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917
## Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937
## Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961
## Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.180 Median :0.4700 Median :10.40 Median :6.000
## Mean :3.188 Mean :0.4898 Mean :10.51 Mean :5.878
## 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000
## quality_lev_f fixed_to_volatile_acid_level sugar_to_sulfates_ratio
## below avg. : 183 Min. : 5.545 Min. : 1.169
## avg. :1457 1st Qu.:20.628 1st Qu.: 3.571
## above avg. :3078 Median :26.071 Median :10.932
## exceptional: 180 Mean :27.657 Mean :13.732
## 3rd Qu.:33.000 3rd Qu.:21.244
## Max. :90.000 Max. :95.362
## [,1] [,2]
## fixed.acidity 3.80000 14.20000
## volatile.acidity 0.08000 1.10000
## citric.acid 0.00000 1.66000
## residual.sugar 0.60000 65.80000
## chlorides 0.00900 0.34600
## free.sulfur.dioxide 2.00000 289.00000
## total.sulfur.dioxide 9.00000 440.00000
## density 0.98711 1.03898
## pH 2.72000 3.82000
## sulphates 0.22000 1.08000
## alcohol 8.00000 14.20000
## quality 3.00000 9.00000
## fixed_to_volatile_acid_level 5.54500 90.00000
## sugar_to_sulfates_ratio 1.16900 95.36200
Transformed the data to remove the long tails and reveal a Unimodal Distribution. The majority of the Fixed Acidity values lie between 6.3 and 7.3.
Transformed the data to remove the long tails and reveal a Unimodal Distribution with another singular peak at 0.20. The majority of the Volatile Acidity values lie between 0.21 and 0.32.
Transformed the data to remove the long tails and reveal a Unimodal Distribution with another singular peak at 0.50. The majority of the Citric Acid values lie between 0.27 and 0.39.
Transformed the data to remove the long tails and reveal what can best be described as a bimodal distribution with peaks around 1.25 and 1.75. The majority of the Residual Sugar values lie between 1.70 and 1.90. Before removing the tails, there was a fairly even distribution for sugar values of 2 to 5 where the frequency was between 50 and 200.
Transformed the data to remove the long tails and reveal a unimodal distribution with a peak around 0.0425.
Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Free Sulfur Dioxide values lie between 23.0 and 46.0.
Transformed the data to remove most of the long tails and reveal a mostly unimodal distribution. The majority of the Total Sulfur Dioxide values lie between 108.0 and 167.0.
Transformed the data to remove the long tails. The majority of the Density values lie between 0.991 and 0.996.
Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the pH values lie between 3.09 and 3.28.
Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Sulfates values lie between 0.41 and 0.55.
Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Alcohol values lie between 9.50 and 11.40.
Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Quality values lie between 5 and 7.
Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Fixed-To-Volatile Acid values lie between 20.628 and 33.000.
Reveals a long tailed unimodal distribution. The majority of the Residual Sugar To Sulfates Level values lie between 3.571 and 21.244.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
The Dataset contains 4,898 white wines and with 11 variables quantifying the chemical properties of each wine (“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol” and “quality”). It’s worth noting that there were no wines in the data set with qualities of 0, 1, 2 or 10. The “Quality” attribute is a sensory output derived from the median of at least 3 evaluations made by wine experts while all other attributes are physicochemical inputs. Excluding the Alcohol attribute, all variables have outliers above their interquartile range.
The main feature in the data set is the “Quality” output variable. The goal is to determine which input variable(s) have the greatest influence on the perceived wine quality. I suspect Alcohol and pH are two of the leading contributors to perceived quality.
Notes about the data set indicate that several of the attributes may be correlated. I believe there’s likely a correlation between pH, Fixed Acidity, Volatile Acidity and Citric Acid. I also believe there may be a correlation between Alcohol and Residual Sugar.
The “quality” variable was used to create 4 Quality-Level factors for a “quality_lev_f” variable. The 4 factors are: “below avg.”, “avg.”, “above avg.” and “exceptional”. Below Average wines had quality levels from 0 to 4. Average wines had a quality level of 5. Above Average wines had quality levels from 6 and 7. Exceptional wines had quality levels from 8 to 10.
Another variable called “fixed_to_volatile_acid_level” was created to investigate if the balance between Fixed and Volatile Acids plays any role in perceived quality. Likewise another variable called “sugar_to_sulfates_ratio” was created to investigate if the balance between Residual Sugar and Sulfates plays any role in perceived quality.
The most unusual distribution thus far is the Chlorides distribution. The interquartile range is only 0.01 and there are a significant number of outliers above the interquartile range. Many other variables have outliers, so I expect to need to limit data based on quartiles to reduce outliers or use a log10 transformation. Most transforms performed will serve to eliminate distributions tails or reveal what sort of distribution a variable has.
Based on the the calculations from the Linear Regression models and R2 values, there are a few key independent variables that explain the variance in Wine Quality. The top 5 variables and their adjusted R2 values are ranked in descending order as follows:
We’ll continue to focus on what relationships these variables have with wine quality.
## [1] "adjusted R2= 0.01272 fixed.acidity"
## [1] "adjusted R2= 0.03772 volatile.acidity"
## [1] "adjusted R2= -0.00012 citric.acid"
## [1] "adjusted R2= 0.00932 residual.sugar"
## [1] "adjusted R2= 0.04388 chlorides"
## [1] "adjusted R2= -0.00014 free.sulfur.dioxide"
## [1] "adjusted R2= 0.03034 total.sulfur.dioxide"
## [1] "adjusted R2= 0.09414 density"
## [1] "adjusted R2= 0.00968 pH"
## [1] "adjusted R2= 0.00268 sulphates"
## [1] "adjusted R2= 0.18956 alcohol"
## [1] "adjusted R2= 1 quality"
## [1] "adjusted R2= 0.83039 quality_lev_f"
## [1] "adjusted R2= 0.0206 fixed_to_volatile_acid_level"
## [1] "adjusted R2= 0.00648 sugar_to_sulfates_ratio"
There are too many variables at play in the above Correlation Chart, let’s focus on variables with a strong correlation with Wine Quality. There’s a strong positive or negative correlation between Alcohol and most other variables except for Volatile Acidity. The Strongest positive correlation between Wine Quality is that of Alcohol. The Strongest negative correlation between Wine Quality is that of Wine Density.
There’s a smaller alcohol range for Quality 3 and Exceptional wines. Above Average quality wines have the largest Alcohol content range. Average quality wines tended to have an Alcohol content range between 9 and 10. Exceptional Quality Wines had an ABV between 11 and 13.
The mean density appears to decrease as the wine quality increases. The average quality wine has a largest variability in density. Quality 3 and Exceptional wines have the smallest density range.
The Total Sulfur Dioxide level appears to decrease as the quality increases. All quality levels exception for Quality 9 have a large range in values, however the maximum Total Sulfur Dioxide for Exceptional wines is much lower than that of lower quality wines.
The Chloride level decreases as the wine quality increases. The Chloride range between the different quality levels is pretty narrow. Despite its lower ranking in the Top 5 R-Squared values, this narrow range may indicate that the amount of Chlorides goes a long way in affecting a wine’s quality.
There’s similar Volatile Acid levels between Below Average and Exceptional Wines. Above Average Wines have a similar mean Volatile Acidity of around 0.25. Excluding Quality 3 and 9 Wines, all quality levels seem to have a similar range in values. However, I would assume that with more data points for Quality 3 and 9 Wines, we’d likely see a similarly large range.
There’s not a significant difference in pH values between wine quality levels with most values being between 3.1 and 3.25 pH. Average and Above Average wines had the largest pH ranges.
Exceptional Quality Wines and Quality 3 Wines have the smallest range of Citric Acid values. Average and Above Average Wines have the largest range of Citric Acid values.
The 1st and the 99th Percentile for pH values become significantly smaller for Exceptional Wines.
Narrower Sugar to Sulfate Ratio range for the Exceptional quality wines.
Average quality wines tend to be between 9% and 11% ABV. ABV tends to increase linearly with wine quality.
The 99th Percentile for Total Sulfur Dioxide drops significantly for Exceptional wines and generally the mean Total Sulfur Dioxide appears to decrease for Above Average wines.
The linear model indicates that the alcohol level generally increases as the quality increases. There significantly more Quality 8 Exceptional Wines than there are Quality 9 Exceptional Wines. The Quality 9 wines are most likely to be around 13% ABV. Qualities 4-7 have a generally even distribution of alcohol levels ranging from 8.5% to 13%. The majority of the Above Average Quality wines are of Quality 6 at 11% ABV.
As density decreases and alcohol increases, so does the quality of wine.
Exceptional Wines are below 0.06 Chloride and are generally between 84 and 134 Total Sulfur Dioxide.
There’s a lot of similarity between Total Sulfur Dioxide and Volatile Acidity for both Below Average and Exceptional Wines. Exceptional wines tended to have slightly more clustering in their values, but that difference between Below Average wines is negligible
The most significant relationships related to quality were the relationship between Quality and Alcohol (0.436) and the negative relationship between Quality and Density (-0.307). The mean density appears to decrease as the wine quality increases and inversely, the alcohol level increases as the wine quality increases.
Two of the most interesting and unexpected relationships were between Total Sulfur Dioxide and Density (0.53), Total Sulfur Dioxide and Alcohol (-0.499), as well as the negative relationship between Alcohol and Density (-0.78). This may indicate that denser wines can hold more Total Sulfur Dioxide in solution and/or the Total Sulfur Dioxide in solution contributes to density. The relationship between Total Sulfur Dioxide and Alcohol may indicate that Sulfur Dioxide dissipates as the Alcohol level increases with fermentation. The Alcohol and Density relationship seems to indicate that Alcohol decreases density sense alcohol is less dense than water and residual sugar.
The strongest relationship was the negative relationship between Alcohol and Density with a correlation value of -0.78. Overall Alcohol appears to correlate with many variables.
Exceptional quality wines tend toward higher alcohol and lower density.
Below Average wines have a bit of clustering around the low end of Residual Sugars, but their Alcohol Range is wide. Above Average wines favor higher alcohol and generally have Residual Sugar values below 10.
All wines quality levels tend to have the same range of pH values. Average wines tend to cluster around 3.0 to 3.3 pH levels and 9% to 10% ABV. Above Average wines have the greatest range of values for both pH and Alcohol levels.
Most wines tend to cluster around low values (2 to 3) for Residual Sugar yet have an even range of pH values.
Most wines tend to cluster around low values (2 to 3) for Residual Sugar. Average and Above Average wines tend to have alcohol levels between 9 and 12.5.
Most wines tend to cluster around low values (2 to 3) for Residual Sugar and share a similar range of 50 to 200 for their Total Sulfur Dioxide.
Higher quality wines tended to be on the upper end of the alcohol range with values between 11.40 to 12.18.
## white_wine$quality_lev_f: below avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.10 10.17 10.80 13.50
## ------------------------------------------------------------
## white_wine$quality_lev_f: avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## ------------------------------------------------------------
## white_wine$quality_lev_f: above avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.8 10.8 10.8 11.8 14.2
## ------------------------------------------------------------
## white_wine$quality_lev_f: exceptional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.65 12.60 14.00
## white_wine$quality_lev_f: below avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9960 1.0004
## ------------------------------------------------------------
## white_wine$quality_lev_f: avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0024
## ------------------------------------------------------------
## white_wine$quality_lev_f: above avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9912 0.9930 0.9935 0.9955 1.0390
## ------------------------------------------------------------
## white_wine$quality_lev_f: exceptional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0006
## white_wine$quality_lev_f: below avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.545 14.891 22.188 22.854 27.600 60.588
## ------------------------------------------------------------
## white_wine$quality_lev_f: avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.974 19.697 23.871 25.328 29.200 69.286
## ------------------------------------------------------------
## white_wine$quality_lev_f: above avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.987 21.481 27.600 29.043 35.263 90.000
## ------------------------------------------------------------
## white_wine$quality_lev_f: exceptional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.091 19.229 26.727 27.681 35.870 61.538
The minimum Residual Sugar to Sulfate Level is the highest for Quality 9 wines. The mean Residual Sugar to Sulfate Level is lowest for Quality 9 wines. The maximum Residual Sugar to Sulfate Level is lower for both lower (Quality 3 and 4) and the highest quality wine. The Residual Sugar to Sulfate Level range is smallest for Quality 9 wines.
## white_wine$quality_lev_f: below avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.628 3.056 5.294 10.387 15.819 35.000
## ------------------------------------------------------------
## white_wine$quality_lev_f: avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.29 3.81 14.81 15.61 23.48 64.57
## ------------------------------------------------------------
## white_wine$quality_lev_f: above avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.169 3.556 10.000 13.077 19.952 95.362
## ------------------------------------------------------------
## white_wine$quality_lev_f: exceptional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.368 3.825 8.757 13.113 21.297 48.333
Excluding wines of quality 3 and 4, as the quality of wine increase, so does the mean pH value the max pH level for wines of quality 8 and 9 drops significantly compared to qualities 3 through 7.
## white_wine$quality_lev_f: below avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.830 3.060 3.160 3.183 3.285 3.720
## ------------------------------------------------------------
## white_wine$quality_lev_f: avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.790 3.080 3.160 3.169 3.240 3.790
## ------------------------------------------------------------
## white_wine$quality_lev_f: above avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.190 3.196 3.290 3.820
## ------------------------------------------------------------
## white_wine$quality_lev_f: exceptional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.940 3.127 3.230 3.221 3.330 3.590
The lowest quality wine has over 3x times the maximum Total Sulfur Dioxide when compared to the highest rated. Excluding Quality 5 wines, the max Total Sulfur Dioxide tended to decrease as the quality increased.
## white_wine$quality_lev_f: below avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 85.5 119.0 130.2 177.0 440.0
## ------------------------------------------------------------
## white_wine$quality_lev_f: avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 121.0 151.0 150.9 182.0 344.0
## ------------------------------------------------------------
## white_wine$quality_lev_f: above avg.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 105.0 129.0 133.6 159.0 294.0
## ------------------------------------------------------------
## white_wine$quality_lev_f: exceptional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.0 102.8 122.0 125.9 148.5 212.5
The most notable relationship was that of Wine Quality, Alcohol and Density. Exceptional quality wines tend toward higher alcohol and lower density and Below Average wines Exceptional quality wines tend toward lower alcohol and higher density.
The relationship between Quality, Alcohol and Residual sugars was less definitive, but Below Average wines have a bit of clustering around the low end of Residual Sugars, but their Alcohol Range is wide. Above Average wines favor higher alcohol and generally have Residual Sugar values below 10.
There weren’t any significant relationships between Quality, pH and Citric Acid. Values for both Below Average and Exceptional wines were seemingly equally distributed.
There weren’t any significant relationships between Quality, Alcohol and pH. All wines quality levels tend to have the same range of pH values. Average wines tend to cluster around 3.0 to 3.3 pH levels and 9% to 10% ABV. Above Average wines have the greatest range of values for both pH and Alcohol levels.
In many cases there are quite a few similarities between the highest and lowest quality wines. The Average and Above Average wines offered the greatest range of values for most variables. There were interesting discrepancies between a few of the Quality 8 and 9 wines. Quality 9 wines tended to see higher alcohol levels, lower Total Sulfur Dioxide, a lower Sugar-To-Sulfate Ratios and increased mean pH value.
You will select three plots from your analysis to polish and share in this section. The three plots should show different trends and should be polished with appropriate labels, units, and titles (see the Project Rubric for more information).
The distribution above, shows a trend of the Chloride level decreasing as the wine quality increases. The most significant decrease is from 0.040 for Quality 8 wines, down to 0.0325 for Quality 9 wines. The difference in Chloride range between the different quality levels is seemingly negligible, however, despite Chlorides’ lower ranking in the Top 5 R-Squared values, this narrow range of approximately 0.025 may indicate that the amount of Chloride goes a long way in affecting a wine’s flavor and quality.
There’s a lot of similarity between Total Sulfur Dioxide and Volatile Acidity for both Below Average and Exceptional Wines. However, Exceptional wines tended to have slightly more clustering in their values in both Total Sulfur Dioxide and Volatile Acidity. The Total Sulfur Dioxide values for Exceptional wines form a near-normal distribution, whereas the Volatile Acidity form a looser normal distribution with a higher variability in values. The Below Average wines have significantly longer tails in their Total Sulfur Dioxide distribution, but that difference between Below Average wines is negligible.
As density decreases and alcohol increases, so did the quality of wine. The balance between Alcohol and Density is key determinant of wine quality. Below Average wines tended to have a higher density and Exceptional wines had a lower density. The opposite was true for Alcohol levels — as Alcohol level increased, so did the wine quality. Below Average wine tended to have a broader range of Alcohol and Density values whereas the Exceptional wines tended to cluster around the 12% to 13% ABV and 0.985 to 0.9925 Density levels. It is worth pointing out that it’s unclear if wines received a higher quality because of their higher ABV, because of their lower density or a combination of both factors.
Some of the major struggles with this analysis were determining which variables to compare, determining how and when to transform or limit vectors and determining which type of plots or functions conveyed information in the clearest manner. Also, determining which R packages provided which statistical and plotting functionality made the initial investigate progress slowly. Early on, one struggle was hoping to find a strong correlation between Wine Quality and variables (pH and Citric Acid specifically) when there simply wasn’t a strong correlation compared to other variables. Lastly, understanding which additional ratios or calculations would serve as an additional column vector to explore was initially unclear until re-examining the variable definitions and variable relationships.
The largest success came from using Adjusted R-Squared values and GGPairs to help isolate which variable relationships were worth exploring. These tools also highlighted whether a positive or negative correlation existed between the variables and helped explain interactions and trends visible on plots.
Grouping the numerical Quality vector into descriptive factors made it easier to group wines by quality using common quality descriptors and made easier to see trends between different wine quality groupings.
Using Marginal Histograms was also immensely beneficial in communicating distributions, scatterplot clustering and patterns between the wine quality levels.
Lastly, realizing there was a strong correlation between Alcohol and Density helped drive a comparison between Alcohol, Density and Quality — this ultimately unveiled a strong relationship between the 3 variables, which makes sense in retrospect that as the density or sugars decreases, the amount of alcohol increases.
There’s plenty of room for further exploration and modeling. As with most analysis, a larger data set specifically for Below Average and Exceptional wines would tease out trends and smooth out averages. There was plenty of data for Average Quality wine but fewer data points for the tails of the Quality Distribution. Also, more chemical compounds and variables from the wine such as grape varietal(s), age, price, brand, and region would be interesting to explore.